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(54) Encoder for an end-to-end scalable video delivery system 



(57) A software-based encoder is provided for an 
end-to-end scalable video delivery system that operates 
over heterogeneous networks. The encoder utilizes a 
scalable video compression algorithm based on a La- 
placian pyramid decomposition to generate an embed- 
ded information stream. The decoder decimates a high- 
est resolution original image, e.g., 640x480 pixels, to 
produce an intermediate 320x240 pixel image that is 
decimated to produce an intermediate 1 60x1 20 pixel im- 
age that is compressed to form an encodable base layer 
160x120 pixel image. This base layer image is decom- 
pressed to form an image that is up-sampled by inter- 
polation to produce an up-sampled 320x240 pixel im- 
age. This up-sampled image is subtracted from the in- 
termediate 320x240 pixel image to form an error image 
that is compressed and encoded as a first enhancement 
640x480 pixel layer. The decompressed base layer im- 
age is also up-sampled at step to produce an up-sam- 
pled 640x480 pixel image that is subtracted from the 
original 640x480 pixel image 200 to yield an error image 
that is compressed to yield a second enhancement 
320x240 pixel layer. Collectively, the base and enhance- 
ment layers comprise the transmitted embedded bit 
stream. At the receiving end, the decoder extracts from 
the embedded stream different streams at different spa- 
tial and temporal resolutions. Because decoding re- 
quires onfy additions and look-ups from a small stored 
table, decoding occurs in real-time. 
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Description 

FIELD OF THE INIVFMTinM 

^IH t P h!, Sen M inven l to " relates 9 eneral| y to video delivery systems, and more specifically to encoders for such 

EZ^£S?~ to be **" d scalab,y ' so as to maximi2e use of network resources and to **- 

BACKGROUND OF THE INVENTION 

It is known in the art to use server-client networks to provide video to end users, wherein the server issues a 
separate video stream for each individual client. 

A library of video sources is maintained at the server end. Chosen video selections are signal processed by a 
server encoder stored on digital media, and are then transmitted over a variety of networks, perhaps on^n basis tLt 

S5ru ^r ,e t ;° interact with the video The vide ° mav be stored ° n ™** sss 2* £ 

have 2 h -nformatron can include video, speech, and images. As such, the source video information may 
have been stored in one of several spatial resolutions (e.g., 160x120, 320x240, 640x480 pixels) and temporal reso- 

!r^s\:°i?Mb P r per second) The source video may presem bandwidths whose dynamic 

k ^ lth l U ?u thS SOUr ° e VidS ° haS Vary,n9 characte "stics, prior art video delivery systems operate with a system 
bandwidth that ,s static or fixed. Although such system bandwidths are fixed, in pract^e, ihe genera, purpose , ciSE 
environment associated with the systems are dynamic, and variations in the networks can also .dK^SS 
can arise from the outright lack of resources (e.g., limited network bandwidth and processor cycles) conten on for 
ava.table resources due to congestion, or a user's unwillingness to allocate needed resources to the task 
r P Jut'ionf Po St6mS t6 , nd l l be Vefy com P utational| y intensive, especially with respect to decoding images of differing 
resolutions. For example where a prior art encoder transmits a bit stream of, say, 320x240 pixel resolution but the 

SET req 7 re !- 6 w 120 PbCe ' reSOlU,i ° a S8Veral prOCeSSeS must be i^oked involving decompress on emropv 
SZX£^,£SZ C ° Sine tranSf ° rmati0n ^ C ° lle ^ -ps retire looTSR 

Color conversions, e.g., YUV-to-RGB are especially computationally intensive, in the prior. In another situation 
an encoder may transit 24 bits, representing 1 6 million co.ors, but a recipient decoder may be coupled to a PC having 

fntens^e ta P sk y ' ^ ^ ^ ^ "™ Mh ° M ° data which «■ « -mputatSy 

Unfortunately, fixed bandwidth prior art systems cannot make full use of such dynamic environments and system 

SXare'anSsoftw 1 wh *" *" m °" M ™ f ° r a level of expendTure fo" sys em 

hardware and software. When congest.on (e.g., a region of constrained bandwidth) is present on the network packets 
of transmitted ,nformat,on will be randomly dropped, with the result that no useful information may be re^eiVe! i by the 

Video information is extremely storage intensive, and compression is necessary during storage and transmission 
A though scalable compression would be beneficial, especially for browsing in multimedia video sources SSE££ 
predion systems do not provide desired properties for scalable compression. By scalable compression ill meanUbat 
a fu. dynamic range of spatial and temporal resolutions should be provided on a single embedded v^eo s^eam nl 
is output by the server over the network(s). Acceptable software-based scalable techniques are notCdtthe pl 
lib,nZZ7 ' » . f impression standard offers limited extent sca.ability, buUacks sufficient dynam o range 
support ' S * in SOftWar6 ' ^ USSS Variab ' e ' en9,h COdeS tnat addftional ™ oorreSS 

hnarHWt!!' KL? COmpression ^ndards typically require dedicated hardware at the encoding end, e g an MPEG 
^ V , MPEG com P ression ^andard. While some prior art encoding techniques are software-based and operate 

hard ? re ( ° ther than 3 f3St CSntral PrOCeSSin9 Unit) known'software-basTd a^ZtTeZ 
computat ona. •ntensive to operate in real-time. For example, JPEG software running on a SparcStaSn 10 workstat on 

cZ^Zl T^ 60 ^' ab ° Ut 1% ° f the frame/sec °nd capability of the present invention 
Considerable video server research in the prior art has focussed on scheduling policies for on-demand situations 

ctTd^ri and R f? I s s t . Prior art encoder operation typical * is de P endent "p°n the 

client decoders. Simply stated, relatively little work has been directed to video server systems operable over hetoro 
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geneous networks having differing bandwidth capabilities, where host decoders have various spatial and temporal 
resolutions. 

In summary, there is a need for a video delivery system that provides end-to-end video encoding such that the 
server outputs a single embedded data stream from which decoders may extract video having different spatial reso- 
lutions, temporal resolutions and data rates. The encoder should be software-based and provide video compression 
that is bandwidth scalable, and thus deliverable over heterogeneous networks whose transmission rates vary from 
perhaps 10 Kbps to 10 Mbps. Such a system should accommodate lower bandwidth links or congestion, and should 
permit the encoder to operate independently of decoder capability or requirements. 

The decoder for such system should be software-based (e.g., not require specialized dedicated hardware beyond 
a computing system) or should be implemented using inexpensive read-only memory type hardware, and should permit 
real-time decompression. The system should permit user selection of a delivery bandwidth to choose the most appro- 
priate point in spatial resolution, temporal resolution, data-rate and in quality space. The system should also provide 
subjective video quality enhancement, and should include error resilience to allow for communication errors. 

The present invention provides a software-based encoder for such a system. 

SUMMARY OF THE INVENTION 

The present invention provides a software-based server-encoder for an end-to-end scalable video delivery system, 
wherein the server-encoder operates independently of the capabilities and requirements of the software -based decoder 
(s). The encoder uses a scalable compression algorithm based upon Laplacian pyramid decomposition. An original 
640x480 pixel image is decimated to produce a 320x240 pixel image that is itself decimated to yield a 160x120 pixel 
base image that is encoder-transmitted. 

This base image is then compressed to form a 160x120 pixel base layer, that is decompressed and up-sampled 
to produce an up-sampled 320x240 pixel image. The up-sampled 320x240 pixel image is then subtracted from the 
320x240 pixel image to provide an error image that is compressed as transmitted as a first enhancement layer. The 
160x120 pixel decompressed image is also up-sampled to produce an up-sampled 640x480 pixel image that is sub- 
tracted from the original 640x480 pixel image to yield an error image that is compressed and transmitted as a second 
enhancement layer. 

Collectively the base layer, and first and second enhancement layers comprise the single embedded bitstream 
that may be multicast over heterogeneous networks that can range from telephone lines to wireless transmission. 
Packets within the embedded bit-stream preferably are prioritized with bits arranged in order of visual importance. The 
resultant bit stream is easily rescaled by dropping less important bits, thus providing bandwidth scalability dynamic 
range from a few Kbps to many Mbps. Further, such embedded bit stream permits the server system to accommodate 
a plurality of users whose decoder systems have differing characteristics. The transmitting end also includes a market- 
based mechanism for resolving conflicts in providing an end-to-end scalable video delivery service to the user. 

At the receiving end, decoders of varying characteristics can extract different streams at different spatial and tem- 
poral resolutions from the single embedded bit stream. Decoding a 160x120 pixel image involves only decompressing 
the base layer 160x120 pixel image. Decoding a 320x240 pixel image involves decompressing and up-sampling the 
base layer to yield a 320x240 pixel image to which is added error data in the first enhancement layer following its 
decompression. To obtain a 640x480 pixel image, the decoder up-samples the up-sampled 320x240 pixel image, to 
which is added error data in the second enhancement layer, following its decompression. Thus, decoding is fast and 
requires only table look-ups and additions. Subjective quality of the compressed images is enhanced using perceptual 
distortion measures. The system also provides joint-source channel coding capability on heterogenous networks. 

Other features and advantages of embodiments of the invention will appear from the following description in which 
the preferred embodiments have been set forth in detail, in conjunction with the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIGURE 1 is a block diagram of an end-to-end scalable video system, in which the present invention may be 
embodied; 

FIGURE 2 is a block/flow diagram depicting a software-based encoder that generates a scalable embedded video 
stream, embodying the present invention; 

FIGURE 3 is a block/flow diagram depicting a decoder recovery of scalable video from a single embedded video 
stream. 
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DETAILED DESCRIPTIO N OF THE PREFERRED EMBODIMENTS 

th. SSTt ■ d6P i CtS a 1 6nd - ,0 - end scalab,e video de,iver y system, including a software-based encoder embodyinq 
2n» Z ™ 6 T" , A S ° UrCe 01 aUdi ° ^ VidS ° information W is coupled to a server or encoder S The en^oTr 

. 9 ,S ° f sof,ware - based systems 40, 40', which decoder uses minimal 

ZL ZZZ Z T fe T tCeS - NStWOrk transmission ma V ^ through a so-cal.ed network cloud 50, from wnThTe 
to^T ,n,ormatlon ^earn * ™lticast «o the decoders, or transmission to the decoders 40' may be point 

nll^Z? 3 ? hetero f eneous in that tne V h^e widely varying bandwidth characteristics, ranging from as low 
as perhaps 1 0 Kbps for telephones, to 1 00 Mbps or more for ATM networks. As will be described, he singte eTbSded 

ZZTn. " feaC% SCa,ed ' 38 needed ' ,0 aCCommodate a '-er bandwidth network link o^X" SSS 

Server 20 includes a central processor unit ("CPU') with associated memory, collectively 55 a scalable video 

ZTZ^ C Z dn9 10 Pr r nt inVenti ° n - 3 m8ChaniSm 70 for s ^ronizing a " d -> vldSidUSS!^ 
a mechanrsm 80 for arranging the information processed by the scalable video encoder onto video disks 9oToT other 
storage "^Storage 100 is also provided for signal processed audio information. Software comprSthe icateote 
video encoder 60 preferably is digitally stored within server 20, for example, within the memory SSSSUSS 

An admission control mechanism 110 is coupled to the processed video storage 90, as is a communication error 
recovery mechanism 120 for handling brt errors or packet cell loss. The decoder algorithm proS ZrTrZZZ 
lTKac 8 e^ mmUnbatiOT err ° re — ~^ates with the ^ne^^ZZo^Zt 

dedicatefhtdw! 0 T'T' " ^ ^ ^ ^ *" 1 preferab,V is ™P'*™nted in software on* (e g no 
ataS ^ZtT' h T , 9enerateS 3 Sin9 ' e 6mbedded information stream. Encoder 60 employs a new video coding 
algorithm based on a Laplaaan pyramidal decomposition to generate the embedded information stream TlloZln 

SJon 9 T f 1 6mbedded stream al,ows server 20 to host decoders 40, 40' having various spatial an temporal 
resolutons without the server having to know the characteristics of the recipient decoders) P 

Wrth reference to Figure 2, an original 640x480 pixel image 200 from source 10 is coupled to the scalable video 
22of s tlfpin S,6P 210> tHiS ima9e " d6Cimated <•*■ fi,tered and sub-sampled) l^^Tli^Te 

For the 1 60x1 20 pixel base layer, encoding preferably is done on a 2x2 blocks (e g two adjacent oixels on on* 

f^^iJSSf?!: T, a ne f ,ine definin9 ,he b,ock) wtth DCT followed bv t«^2^5SSS 

LIh n?T ° . T„ xo 31 ,rans,orm For the 320x240 f^st enhancement layer, encoding is done on 4x4 bSte 
£ DCT followed by TSVq! ^ *" ^ ^ ^ * d ™ « 8x8 b '^ a^n 

At step 250, the 160x120 pixel base image 240 is compressed to form a 160x120 Dixel base l au *r 9Rn a nn 

At summation step 310, the up-sampled 320x240 pixel image 300 is subtracted from the 320x240 pixel imaae 220 
S n ;i7lT' ^ StSP err ° r im39e 320 iS C ° mpreSSed and then transmi » ed as a^rce^nt 

nixeMmVnt^n 0 P i XSl decompressed imaae 280 is also up-sampled at step 350 to produce an up-sampled 640x480 
pixel image 360. At summation step 370, the up-sampled 640x480 Dixel imaae -*5n « enhtr^tl* * " p '=° owx- *°" 
640X480 pixel image 200 to yield an error image 5». At'step 3^0 the eT i to 1 d a Second 

enhancement 320x240pixel layer 400that is transmitted. Collectively, layers 260 340 

bit-stream generated by the scalable video encoder 60. compr.se the embedded 

Thus, it is appreciated from Figure 2 that a scalable video encoder 60 accordinq to the oresent invention e „^« 

240. The first enhancement layer 340 has error data for the compressed 320x240 pixel imaoe 22C > and 
enhancement layer 400 has error data for the compressed 640x480 pixel image So 9 
^lUru'^T emb ° d,ment uses vector quantization across transform bands to embed coding to provide bandwidth 
scalability with an embedded bit stream. Vector quantization techniques are known in the art See tarlamTT 
Gerso and R. M. Gray, "Vector Quantization and Signal Compression", K.uwer Academic Press 1992 ' ' 

<"TSvS? t T T*!, 3nd V6Ct0r quantization mav each be P^ormed by tree-structured vector quantization methods 
( TSVQ ), e.g.. by a successes approbation version of vector quantization (W). in ordinary VQ, the coZoS 



c c 

EP 0 739 140 A2 

lie in an unstructured codebook, and each input vector is mapped to the minimum distortion codeword. 
Thus, VQ induces a partition of a input space into Voronoi encoding regions. 

By contrast, when using TSVQ, the codewords are arranged in a tree structure, and each input vector is succes- 
sively mapped (from the root node) to the minimum distortion child node. As such, TSVQ induces a hierarchical partition, 

5 or refinement of the input space as three depth of the tree increases. Because of this successive refinement, an input 
vector mapping to a leaf node can be represented with high precision by the path map from the root to the leaf, or with 
lower precision by any prefix of the path. 

Thus, TSVQ produces an embedded encoding of the data. If the depth of the tree is R and the vector dimension 
is k, then bit rates 0//c, , R/k can all be achieved. To achieve further compression, the index-planes can be run- 

io length coded followed by entropy coding. Algorithms for designing TSVQs and its variants have been studied exten- 
sively. The Gerso and Grey treatise cited above provides a background survey of such algorithms. 

In the prior art, mean squared error typically is used as distortion measure, with discrete cosine transforms ("DOT") 
being followed by scalar quantization. By contrast, the present embodiment performs DCT after which whole blocks 
of data are subjected to vector quantization, preferably with a perception model. 

is Subjectively meaningful distortion measures are used in the design and operation of the TSVQ. For this purpose, 

vector transformation is made using the DCT Next, the following input-weighted squared error is applied to the trans- 
form coefficients: 

20 K 

25 

In the above equation, and >* are the components of the transformed vector y and of the corresponding repro- 
duction vector y, whereas is a component of the weight vector depending in general on y Stated differently, distortion 
is the weighted sum of squared differences between the coefficients of the original transformed vector and the corre- 
sponding reproduced vector. 

30 in the described arrangement, the weights reflect human visual sensitivity to quantization errors in different trans- 

. form coefficients, or bands. The weights are input-dependent to model masking effects. When used in the perceptual 
distortion measure for vector quantization, the weights control an effective stepsize, or bit allocation, for each band. 
When the transform coefficients are vector quantized with respect to a weighted squared error distortion measure, the 
role played by weights m/^,..., w k corresponds tostepsizes in the scalar quantization case. Thus, the perceptual model 

35 is incorporated into the VQ distortion measure, rather than into a stepsize or bit allocation algorithm, This permits the 
weights to vary with the input vector, while permitting the decoder to operate without requiring the encoder to transmit 
any side information about the weights. 

In the first stage of the compression encoder shown in Figure 2, an image is transformed using DCT The second 
stage of the encoder forms a vector of the transformed block. Next, the DCT coefficients are vector quantized using a 

40 TSVQ designed with a perceptually meaningful distortion measure. The encoder sends the indices as an embedded 
stream with different index planes. The first index plane contains the index for the rate Mk TSVQ codebook. The second 
index plane contains the additional index which along with the first index plane gives the index for the rate 21k TSVQ 
codebook. The remaining index planes similarly have part of the indices for 3/k, 4/k, ,/=?/fcTSVQ codebooks, re- 
spectively. 

*s Such encoding of the indices advantageously produces an embedded prioritized bitstream. Thus, rate or bandwidth 

scalability is easily achieved by dropping index planes from the embedded bit-stream. At the receiving end, the decoder 
can use the remaining embedded stream to index a TSVQ codebook of the corresponding rate. 

Frame-rate scalability can be easily achieved by dropping frames, as at present no interframe compression is 
implemented in the preferred embodiment of the encoder algorithm. The algorithm further provides a perceptually 

so prioritized bit-stream because of the embedding property of TSVQ. If desired, motion estimation and/or conditional 
replenishment may also be incorporated into the system. 

Scalable compression is also important for image browsing, multimedia applications, transcoding to different for- 
mats, and embedded television standards. By prioritizing packets comprising the embedded stream, congestion due 
to contention for network bandwidth, central processor unit ("CPU") cycles, etc., in the dynamic environment of general 

ss purpose computing systems can be overcome by intelligently dropping less important packets from the transmitted 
embedded stream. 

Information layout on the video disk storage system 90 (see Figure 1) preferably involves laying the video as two 
streams, e.g., the base layer and the first and second enhancement layer streams. In practice, it is not necessary to 
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be described, the base layer data is stored hiI™h£ J£? h , , I ? enhancemen » ^ clata. As will now 

a set of index planes corre^g t '~ 9 ^ " ^ »~ 

plication level headers for network transmissS, SlfJSL ,hI f 9 P ° f ,ndeX P ' aneS P re - forma «<* with ap- 
storedtogetherasthetirstsecto^fth^ " the J**? indiCes are 

as separate sections of the frame block to provide lookup ndices wS 4 5 6 7 8 bf,f ^T'! St ° red " SeqU8nCe ' 
up indices provide data streams with different banSh ^enCts ' ' ' b ' ,S ' respect,ve| y The diffe ^ look- 

sectirontr^kTi 

at the client-end of the system ' 9 P ' aneS '"^ IOOk " Up indices is ,eft to the recei ™9 application 

signi^nt^biS of C~k^^^ ^ indiC6S P^ a <>* - stored as the most 

second two bits of the lookCirS ™ a the second 1SL n b,OCk h the bit Stream Then ,ollow th * 

indices that are stored to £!£?Z^^J£Tl' ^7 * ,0Uradditi0nal 1 " b « secti ° ns - **up 

might instead be used however. ' ' b,tS> res P ect,ve| y Oher encoding bit patterns 

^c^ r ss several drives - ™ 

are quite rare compa/edl^he numbTc! tim^sThe sZl Seie «™°! ,he P arit V drive * «*ed mc data updates 
look-up indices for an fcd Jl^t^^JS ^S^^Tf T*- ^ a " °' 
steps or fast-forwards the user's display, althoug IT 'to!^^ ^^iT^i t P ° S,,IOnm9 when 3 user sin 9' a 
tation. Use of parity on the stripe level allows S!! « ^ ^ S ° f St ° rage capacit y due to 'ragmen- 
more buffer sp P ace?o hold the fun SSES 3 ,aNUre " *" ^ * ^ 

the ne,wo%^ f ° rmat dir6Ctly " thS baSiS for the packet ^ 

mitted on the network i e^ctV the s3 i aPP " C f t,0n packet neader are r « ad from disk 90 and are trans- 

the four most significant l^Xe^oM^J^ T th ° pre,erred embodiment the base video layer has 
packet, and each additional b^ ^^£^£^^2:^ ^''^ 38 ° ne 2440 b * e 

disk read and packet t£SSKK5S£ ~ " ^ T T USeS theSe me8SUres to schedul * ,he "ext 
start transmitting the next f Ze * The s ™ er can modera e Z f*™ ^ ° * ' 31 X mil,iseconds in *» "™ 

up feedback from the decoder. moderate the transm.ss.on rate based on slow downAspeed- 

The receiving end of the embodiment will now be described with reference to Fi™.™ 1 a, 
decoder(s) 40 include a centra, processing unrt ("CPU") 140 SCJSTS*?^? h reCe ' V ' n9 6nd ' 
including cache memory Decoderfsl 40 further irJi„H~ 1 ' nai lnclucJes a CPU per se and associated memory 

coders is coupled to sound generator, e.g., a speaker, and to video SjS?S ' ^ th ~ d6 " 

forexaZ^ 

are not required, Jexample simplX^ 

mented in hardware, e.g., in a simply CPU' and r^'iy^S^^^'.S?^'^ ' nVen,, ° n may b ° imp,e " 
simple central processor unit CPU' that collective Jwi^th^T ! ^Lf^J 55 W ' tn ' n unrt 155 is a relatively 
produced for a few dollars «*K*vely w.th the associated ROM, represents a hardware unit that may be 
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Target decoder system 40 should be able to define at least 160x120, 320x240, 640x480 pixel spatial resolutions, 
and at least 1 to 30 frames per second temporal resolution. Decoder system 40 must also accommodate bandwidth 
scalability with a dynamic range of video data from 10 kbps to 10 kbps to 10 Mbps. In this arrangement, video encoder 
60 provides a single embedded stream from which different streams at different spatial and temporal resolutions and 
5 different data rates can be extracted by decoders 40, depending on decoder capabilities and requirements. However, 
as noted, encoder embedding is independent of the characteristics of the decoder(s) that will receive the single em- 
bedded information stream. 

For example, decoder 40 can include search engines that permit a user to browse material for relevant segments, 
perhaps news, that the user may then select for full review Within server 20, video storage 90 migrates the full reso- 
10 lution, full frame rate news stories based on their age and access history from disk to CD ROM to tape, leaving lower 
resolution versions behind to support the browsing operation. If a news segment becomes more popular or important, 
the higher resolution can then be retrieved and stored at a more accessible portion of the storage hierarchy 90. 

The decoder(s) may be software-based and merely use the indices from the embedded bit-stream to look-up from 
a codebook that is designed to make efficient use of the cache memory associated with the CPU unit 140. In this 
is arrangement, video stream decoding is straightf onward, and consists of loading the codebooks into the CPU cache 
memory, and performing look-ups from the stored codebook tables. In practice, the codebook may be stored in less 
than about 12 Kb of cache memory. 

Video decoder 1 60 may be software-based and uses a Laplacian pyramid decoding algorithm, and preferably can 
support up to three spatial resolutions, i.e., 160x120 pixels, 320x240 pixels, and 640x480 pixels. Further, decoder 160 
20 can support any frame rate, as the frames are coded independently by encoder 60. 

The decoding methodology is shown in Figure 3. To decode a 160x120 pixel image, decoder 160 at method step 
410 need only decompress the base layer 160x1 20 pixel image 260. The resultant image 430 is copied to videomonitor 
(or other device) 1 80. APPENDIX 1 , attached hereto, is a sample of decompression as used with the present invention. 

To obtain a 320x240 pixel image, decoder 160 first decompresses (step 410) the base layer 260, and then at step 
25 440 up-samples to yield an image 450 having the correct spatial resolution, e.g., 320x240 pixels. Next, at step 460, 
the error data in the first enhancement layer 340 is decompressed. The decompressed image 470 is then added at 
step 480 to up-sampled base image 450. The resultant 320x240 pixel image 490 is coupled by decoder 1 60 to a suitable 
display mechanism 180. 

To obtain a 640x480 pixel image, the up-sampled 320x240 pixel image 450 is up-sampled at step 500 to yield an 

30 image 510 having the correct spatial resolution, e.g., 640x480 pixels. Next, at step 520, the error data in the second 
enhancement layer 400 is decompressed. The decompressed image 530 is added at step 540 to the up-sampled base 
image 510. The resultant 640x480 pixel image 550 is coupled by decoder 160 to a suitable display mechanism 180. 

As seen from Figure 3 and the above-description, it will be appreciated that obtaining the base layer from the 
embedded bit stream requires only look-ups, whereas obtaining the enhancement layers involves performing look-ups 

35 of the base and error images, followed by an addition process. Preferably, the decoder is software-based and operates 
rapidly in that all decoder operations are actually performed beforehand, i.e., by preprocessing. The TSVQ decoder 
codebook contains the inverse DCT performed on the codewords of the encoder codebook. As noted, in applications 
such as video displays where a complex CPU 1 40 would not necessarily be present, the video decoder may be imple- 
mented in hardware, e.g., by storing the functions needed for decoding in ROM 155, or the equivalent. In practice, 

40 ROM 155 may be as small as about 12 Kb. 

Thus, at the decoder there is no need for performing inverse block transforms. Color conversion, i.e. , YUV to RGB, 
is also performed as a pre-processing step by storing the corresponding color converted codebook. To display video 
on a limited color palette display, the resulting codewords of the decoder codebook are quantized using a color quan- 
tization algorithm. One such algorithm has been proposed by applicant Chaddha et al., "Fast Vector Quantization 

4& Algorithms for Color Palette Design Based on Human Vision Perception," accepted for publication IEEE Transactions 
on Image Processing. 

In this arrangement, color conversion involves forming a RGB or YUV color vector from the codebook codewords, 
which are then color quantizing to the required alphabet size. Thus, the same embedded index stream can be used 
for displaying images on different alphabet decoders that have the appropriate codebooks with the correct alphabet 
so size, e.g., 1 -bit to 24-bit color. 

On the receiving end, the video decoder 40, 40' is responsible for reassembly of the lookup indices from the packets 
received from the network. If one of the less significant index bit plane packets is somehow lost, the decoder uses the 
more significant bits to construct a shorter look-up table index. This yields a lower quality but still recognizable image. 

The use of separately identified packets containing index bit planes makes it possible for networks to easily scale 
ss the video as a side effect of dropping less important packets. In networks providing QOS qualifiers such as ATM, 
multiple circuits can be used to indicate the order in which packets should be dropped (i.e., the least significant bit 
plane packets first). In an IP router environment, packet filters can be constructed to appropriately discard less important 
packets. For prioritized networks, the base layer will be sent on the high priority channel while the enhancement layer 
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The processing components include audio capture, video capture, video compression, and a data stripping tool. 
The video is captured and digitized using single step VCR devices. Each frame is then compressed off-line (non- 
real time) using the encoding algorithm. At present, it takes about one second on a SparcStation 20 Workstation to 
compress a frame of video data, and single step VCR devices can step at a one frame per second rate permitting 

5 overlap of capture and compression. 

The audio data preferably is captured as a single pass over the tape. The audio and video time stamps and se- 
quence numbers are aligned by the data striping tool as the video is stored to facilitate later media synchronization. 
The audio and video data preferably are striped onto the disks with a user-selected stripe size, in a preferred embod- 
iment, all of the video data on the server uses a 48 kilobyte stripe size, as 48 kilobytes per disk transfer provides good 

io utilization at peak load with approximately 50% of the disk bandwidth delivering data to the media server components. 

The media server components include a session control agent, the audio transmission agent, and the video trans- 
mission agent. The user connects to the session control agent on the server system and arranges to pay for the video 
service and network bandwidth. The user can specify the cost he/she is willing to pay and an appropriately scaled 
stream will be provided by the server. The session control agent (e.g., admission control mechanism 110) then sets 

is up the network delivery connections and starts the video and audio transmission agents. The session control agent 
110 is the single point of entry for control operations from the consumers remote control, the network management 
system, and the electronic market. 

The audio and video transmission agents read the media data from the striped disks and pace the transmission 
of the data onto the network. The video transmission agent scales the embedded bit -stream in real-time by transmitting 

20 only the bit planes needed to reconstruct the selected resolution at the decoder. For example, a 320x240 stream with 
8 bits of base, 4 bits of enhancement signal at 15 frames per second will transmit every other frame of video data with 
all 5 packets for each frame of the base and only two packets containing the four most significant bits of the enhance- 
ment layer resulting in 864 Kb of network utilization. The server sends the video and audio either for a point-to-point 
situation or a multicast situation. 

25 The media player components are the software based video decoder 40, 40', the audio receiver, and a user inter- 

face agent. The decoder receives the data from the network and decodes it using look-up tables and places the results 
onto the frame buffer. The decoder can run on any modern microprocessor unit without the CPU loading significantly. 
The audio receiver loops reading data from the network and queuing up for the data for output to the speaker. In the 
event of audio packet loss, the audio receiver will ramp the audio level down to silence level and then back up to the 

30 nominal audio level of the next successfully received audio packet. The system performs media synchronization to 
align the audio and video streams at the destination, using techniques such as described by J. D. Northcutt and E. M. 
Kuerner, "System Support for Time-Critical applications," Proc. NOSSDAV 91, Germany, pp 242-254. 

End-to-end feedback is used in the on demand case to control the flow. In the multicast case, the destinations are 
slaved to the flow from the server with no feedback. The user interface agent serves as the control connection to the 

35 session agent on the media server passing flow control feedback as well as the user's start/stop controls. The user 
can specify the cost he or she is willing to pay and an appropriate stream will be provided by the system. 

A prototype system embodying the present invention uses a video data rate that varies from 19.2 kbps to 2 Mbps 
depending on the spatial and temporal requirement of the decoder and the network bandwidth available. The PSNR 
varies between 31 .63 dB to 37.5 dB. Table 1 gives the results for the decoding of a 160x120 resolution video on a 

40 SparcStation 20. It can be seen from Table 1 that the time required to get the highest quality stream (8-bit index) at 
1 60x1 20 resolution is 2.45 ms per frame (sum of lookup and packing time). This corresponds to a potential frame rate 
of 400 frames/sec. 



TABLE 1 . 



| RESULTS FOR 160x120 RESOLUTION (DECODER) 


I No. of Bits of Lookup 


PSNR (dB) 


Bandwidth as a 
function of frame rate 
(N) Kbps 


CPU time per frame 
(ms) 


Packing time per frame 
(ms) 


! 4 


31.63 dB 


19.2N 


1 .24 ms 


0 ms 


I 5 


32.50 dB 


24N 


1 .32 ms 


0.52 ms 


6 


34 dB 


28.8N 


1 .26 ms 


0.80 ms 


7 


35.8 dB 


33.6N 


1.10 ms 


1.09 ms 


8 


37.2 dB 


38.4N 


1.18 ms 


1.27 ms [ 
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TABLE 2. 



No. of Bits of 
Look 141 



RESULTS FOR 320x2*0 RESOLUTION 
(8 BIT-LOOKUP BASE) 



PSNR (dB) 

33.72 dB 
35.0 
3S.65 



Bandwidth as a 

function of 
frame rate (H) 



_48N 
52.SH 
62. 4N 



CPU time per 
frame (as) 



6 .01 
_6.04 
6.05 



Packing time 
per frame (a»> 



0.385 
0.645 
0.92 



6 


i 36.26 dB 


67. 2H 


6.08 ms 


1.20 res 11 


7 


36.9 dB 


72H 


6.04 ms 


1.48 ms 1 


8 


37.5 dB 


76. 8N 


6.09 ms 


1.67 ms || 



Table 3 gives the results for the decoding of a 640x480 ra^oi. iti™ .a^ ^ - 
<™Tab)e3thatthet^ 
resolution is 24.62 ms per frame (sum of S 

frames/sec. P paCk,ng t,me) - Th,s corresponds to a potential frame rate of 40 



No. of Bits of Lookup 



2 
4 



TABLE 3. 

RESULTS FOR 640x480 WITH 320x240 INTERPOLATED 



PSNR (dB) 



33.2 dB 



Bandwidth as a 
function of frame rate 
(N) Kbps 

48N 



CPU time per frame 
(ms) 

22.8 ms 



Packing time per frame 
(ms) 



0.385 ms 
0.645 ms 
0.92 ms 



5 
6 



34 dB 
34.34 dB 



52.8N 
62.4N 
67.2N 



22.87 ms 
23.14 ms 



7 
8 



34.71 dB 
35.07 dB 
35.34 dB 



72N 
76.8N 



22.93 ms 
22.90 ms 
22.95 ms 



1.20 ms 
1.48 ms 
1 .67 ms 
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TABLE 4. RESULTS FOR 160x120 AT THE DISK SERVER 



No. of Bits 
I of Lookup 


Bandwidth as 
a function of 
frame rate 
<N> Kbps 


CPU time per 
frame Cms) 


Seek- time 
(ms) 


Avg. CPU I 
Load 


4 


19.2M 


2.84 ms 


16 ms 


1 * (| 


5 


24N 


3.67 ms 


16 ms 


1 * I 


6 


28.8N 


4.48 ms 


14 ms 


2 X l 


7 


33.6N 


4.92 ms 


14 ms 


2 X fl 



15 



38. 4* 



5.60 mm 



16 mm 



2 X 



20 Similarly, Table 5 shows the results for each individual disk for 320x240 resolution video. It can be seen that 

obtaining the highest quality stream (8-bit base and 8-bit enhancement layer) at 320x240 requires 12.73 ms of CPU 
time and an average CPU load of 7% on a SparcStation 20 workstation. The average disk access time per frame is 
1 8 ms. 



25 


TABLE 5. 






RESULTS FOR 320x240 AT THE DISK SERVER 




No. of Bits of Lookup 


Bandwidth as a function 
of frame rate (N) Kbps 


CPU time per frame (ms) 


Seek-time (ms) 


Avg. CPU Load 


30 


2 


48N 


10.47 ms 


18 ms 


6% 




4 


52.8N 


11.02 ms 


16 ms 


6% 




5 


62.4N 


11.55 ms 


18 ms 


6% 


35 


6 


67. 2N 


12.29 ms 


20 ms 


7% 




7 


72N 


12.55 ms 


20 ms 


7% J 




6 


76.8N 


12.73 ms 


18 ms 


7% 



40 
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/> 

* Scalable Video Diaplayer — Mala Program 
♦^Copyright 1995 Sua Microsystems, inc. 

typedef signed lnt errorcode; 

pixel "b&aelookupcodebook; 
errorcode *erroreodebook; 

errorcode *laxgecodebook; 

/• 

* invert the single eight bit baae index into the base 2x2 codeboo 

* and store into the destination image 
-/ 

static void 

BaaePixelinvert (pixel *dest, int stride, int index) 

index «- 2; /* index into the 4 pixel chunk */ 

* step through the input lndicies and 

I in J er 5 el 9 h J bit index into the base 2x2 codebook 

store into the deatination image at the coaeoooK 

* correct location * 

* each input index givea 4 destination pixels 



void 

Baselnvert(int nv, int nh, pixel *lmage, 

lnt x, y; 
int index; 

unsigned char •dest; 



int stride, unsigned char * source) 



for (y - nv; y > 0; — y) { 
dest - image; 

for (x - nh; x > 0; — x) { 
index - *aourca++; 

BesePixelInvert<dest, stride, index); 
^ dest +- 2; /* step to the next destination p Ixel */ 

^ j image 4- 2 * stride; /. step to the start of the lext two lines V 

t&rs* Ers™™- ar Jar 
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static void 

ErrorPixelXnvert (pixel *deet, int stride pixel baeejixel, i^t err* r_index) 

1 /* oath destination base pixel gets combined with four erro: values */ 

* (dest) - baeejixel + errorlookupcodebook Ierror_index] ; 
Mdest+1) • base_pixel + errorlookupcodebook [error_index+l. ; 



> 



Mdest+stride) - basejpixel ♦ erxorlookupcodebo©k[error_in<ex+4J; 
* (deet+stride+l) - base _pixel + errorlookupcodebook lerror_; ndex+5) $ 



step through the input i n diciee and 

* invert each eight bit base index into the base 2x2 codebook 

* and invert each eight bit error index into the error 4x4 codebooJ 

* combine the results and then store into the destination image 

* at the correct location 

* each pair of input indicies gives 16 destination pixels 
V 



20 

Errorinvert (nv, nh, image, stride/ source , errorband) 
int nv; 
int nh; 
pixel *image; 
int stride; 
2s unsigned char "source; 

unsigned char •errorband; 
< 

unsigned char *linel, *line3; 
int base_index, error_index; 
pixel basejpixel; 

30 

while (nv > 0) { 
— nv; 

linel - image; 

line3 - linel + (2 • stride); 

6 for (x - nh; x > 0; — x) { 

/* index into 4 pixel chunk for base 2x2*/ 
base_index « *source++; 
base~lndex «- 2; 

40 

/* index into the 16 pixel chunk error 4x4*/ 
error_index - • errorband++; 
error_index «- 4; 

base_pixel «= baeelookupcodebook[base index+0 ; 
ErrorPixellnvert (linel, stride, base"pixel, ««or index); 
45 linel +- 2; 

error_index 4-2; 

basejpixel - baeelookupcodebook[base_index+l ; 
ErrorPixellnvert (linel, stride, basej)ixel, <rror index); 
linel +-2; 
error_index +- 6; 

ba©e_pixel - ba3elookupcodebook(base__lndex+2; ; 
ErrorPixellnvert (line 3, stride, base tsixel, < rror index); 
lines 2; - 
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errox_index 2; 

base pixel - baaelookupcodeboofc ibase_inde*K-J 1 
ErrorFixelXnvert (Ilne3, stride, base_pixei, error_index> / 
line3 4- 2| 

4 * stride; /* step to next Bet of four lli.ee */ 



display_agent. () 



iat abase; 
int ncrror; 

unsigned char *baeedataj 
unsigned char *errordata; 

while (global_etatue 1- EXIT) { 

nbaae - AaaeinbleBaaePacketsrroaWetworkC&baaedata) ; 

SetBaaeCode&eok (nbaae) ; 

nerror - Ass embleErrorPacketsFrortNet work (A err or data! ; 

SetErrorCod&Book (nbase) ; 

If (large_dleplay) 

Error invert (120 , 160, X image -> data, xlnageji ldth, basedata, errordai 

elae 

Baseluvert(120, 160, Ximage->data, x images idth, basedata); 
/* use standard Xll put linage to display the result */ 

xputlmage (display, xid, gc, Ximage, 0, 0, 0, 0, xim ge_yidth , ximage_heigh 
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Claims 

1. For use in a video delivery system server having a source of video images, an encoder providing an embedded 
bit stream containing information including image data at at least two spatial resolutions, said embedded bit stream 
being sent over at least one network to at least one decoder, the encoder including: 

a central processor unit coupled to a memory unit; 

encoding means, digitally stored in said memory unit and coupled to said source of video images and receiving 
a first image at first spatial resolution, for decimating said first image to form an first intermediate image at half 
said highest resolution, for decimating said first intermediate image to form a second intermediate image, for 
compressing said second intermediate image to form a base layer image whose resolution is less than said 
first image; 

said encoding means further decompressing said base layer image to form a third intermediate image, and 
interpolating said third intermediate image to form a fourth intermediate image, and for subtracting said fourth 
intermediate image from said first intermediate image to form a fifth intermediate image, and for compressing 
said fifth intermediate image to form a first enhancement layer image whose resolution is less than said first 
image but greater than said base layer image; 

said embedded bit stream containing at least said base layer image and said first enhancement layer image. 

2. The encoder of claim 1 , wherein said embedded bit stream contains at least three spatial resolutions including an 
additional image, whose resolution equals that of said first image, and said base layer image, and wherein said 
encoding means further interpolates said fourth intermediate image to form a sixth intermediate image that is 
subtracted from said first image to form a seventh intermediate image that is compressed to form a second en- 
hancement layer image whose resolution equals that of said first image; 

said embedded bit stream further containing said second enhancement layer image. 

3. The encoder of claim 1 , wherein said embedded bit stream includes spatial resolution data encoded in pixel blocks 
and wherein said encoding means encodes said spatial resolution data using a discrete cosine transformation 
followed by a tree-structured vector quantization upon results of said transformation. 

4. The encoder of claim 2, wherein said first image has a resolution of 640x480 pixels, said additional image has a 
resolution of 320x240 pixels, and said base layer image has a resolution of 160x120 pixels. 

5. The encoder of claim 4, wherein said embedded bit stream includes spatial resolution data encoded in pixel blocks 
of size 2x2 bits for said base layer image, of size 4x4 bits for said additional image, and of size 8x8 bits for said 
first image, and wherein said encoding means encodes said spatial resolution data using a discrete cosine trans- 
formation followed by a tree-structured vector quantization upon results of said transformation. 

6. The encoder of claim 3, wherein transform coefficients include input-weighted squared error defined as follows: 

K 

j=l 

where and y } are components of a transformed vector y and of a corresponding reproduction vector y and where 
is a component of a weight vector generally dependent only upon y 

7. The encoder of claim 6, wherein said tree-structured vector quantization includes a perception model. 

8. The encoder of claim 7, wherein said weight vector components reflect human visual sensitivity to quantization 
errors in different transform coefficients. 

9. The encoder of claim 7, wherein said tree-structured vector quantization has a tree depth R and has a vector 
dimension is k, and wherein bitstream bit rates O/k, , R/k are provided. 

10. The encoder of claim 7, wherein indices are transmitted in said embedded stream with different index planes; 
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a first index plane containing a first index for a rate 1//c tree-structured vector quantization reference, and 
a second index plane containing a second index for a rate 21k tree-structured vector quantization reference. 

11. The encoder of claim 10, wherein said embedded bit stream includes data packets, and wherein said indices are 
associated with a relative priority of importance of at least some of said data packets. 

12. The encoder of claim 11 , wherein said embedded bit stream is sent over a network having throughput insufficient 
for decoder receipt of a highest spatial image, 

wherein said relative priority associated with said data packets permits selective non-transmission of rela- 
tively unimportant packets. 

13. The encoder of claim 11 , wherein data for each video frame is stored together, and wherein each frame has an 
associated set of index planes with associated packet headers. 

14. The encoder of claim 1 , wherein said encoding means further includes at least one option selected from the group 
consisting of (a) estimation of motion associated with a transmitted image, and (b) conditional replenishment of a 
said image. 

15. A method, for use in a video delivery system server having a source of video images, of encoding an embedded 
bit stream containing information including image data at at least two spatial resolutions, said embedded bit stream 
being sent over at least one network to at least one decoder, the method including the following steps: 

(a) providing a central processor unit coupled to a memory unit; and 

(b) providing encoding means, digitally stored in said memory unit and coupled to said source of video images 
and receiving a first image at first spatial resolution, for decimating said first image to form an first intermediate 
image at half said highest resolution, for decimating said first intermediate image to form a second intermediate 
image, for compressing said second intermediate image to form a base layer image whose resolution is less 
than said first image; 

said encoding means further decompressing said base layer image to form a third intermediate image, and 
interpolating said third intermediate image to form a fourth intermediate image, and for subtracting said fourth 
intermediate image from said first intermediate image to form a fifth intermediate image, and for compressing 
said fifth intermediate image to form a first enhancement layer image whose resolution is less than said first 
image but greater than said base layer image; 

said embedded bit stream containing at least said base layer image and said first enhancement layer image. 

16. The method of claim 15, wherein said embedded bit stream contains at least three spatial resolutions including 
an additional image, whose resolution equals that of said first image, and said base layer image, and wherein at 
step (b), said encoding means further interpolates said fourth intermediate image to form a sixth intermediate 
image that is subtracted from said first image to form a seventh intermediate image that is compressed to form a 
second enhancement layer image whose resolution equals that of said first image; 

said embedded bit stream further containing said second enhancement layer image. 

1 7. The method of claim 1 5, wherein said embedded bit stream includes spatial resolution data encoded in pixel blocks, 
and wherein at step (b), said encoding means encodes said spatial resolution data using a discrete cosine trans- 
formation followed by a tree-structured vector quantization upon results of said transformation. 

18. The method of claim 17, wherein at step (b), transform coefficients include input-weighted squared error defined 
as follows: 



K 

j-1 

where and ^ are components of a transformed vector y and of a corresponding reproduction vector y, and where 
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is a component of a weight vector generally dependent only upon y. 

19. The method of claim 18, wherein step (b) includes providing said weight vector components that reflect human 
visual sensitivity to quantization errors in different transform coefficients. 

5 

^ 20. The method of claim 18, wherein at step (b), said tree-structured vector quantization has a tree depth R and has 
a vector dimension is k, and wherein bitstream bit rates OIK , R/fcare provided. 

fj 21 . The method of claim 1 8, wherein said embedded bit stream includes data packets, and wherein at step (b), indices 

10 are transmitted in said embedded stream with different index planes; 

a first index plane containing a first index for a rate 1/k tree-structured vector quantization reference, and 
a second index plane containing a second index for a rate 21k tree-structured vector quantization reference; 
said indices being associated with a relative priority of importance of at least some of said data packets 
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termediate 320x240 pixel image to form an error image 
that is compressed and encoded as a first enhancement 
640x480 pixel layer. The decompressed base layer im- 
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pled 640x480 pixel image that is subtracted from the 
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tial and temporal resolutions. Because decoding re- 
quires only additions and look-ups from a small stored 
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